For web scraping engineers, data cleaning and storage represent the final yet most tedious step in the workflow. In enterprises scraping thousands of websites, this often becomes a dedicated role – the data cleaning specialist.
Here are the top 10 most efficient data cleaning techniques from my daily scraping practice:
1. XPath
XPath is my most frequently used HTML parsing method – mastering it solves 90%+ scraping data cleaning challenges.
Use Case: When scraped data is embedded in HTML code.
Example: Extracting Fortune Global 500 company data (2024 rankings):
import requests
from parsel import Selector
url = "https://www.fortunechina.com/fortune500/c/2024-08/05/content_456697.htm"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
response.encoding = 'utf8'
selector = Selector(text=response.text)
companies = selector.xpath('//div[@class="hf-right word-img2"]/div[@class="word-table"]/div[@class="wt-table-wrap"]/table/tbody/tr')
for company in companies:
rank = company.xpath('./td[1]/text()').get()
name = company.xpath('./td[2]/a/text()').get()
revenue = company.xpath('./td[3]/text()').get()
profit = company.xpath('./td[4]/text()').get()
country = company.xpath('./td[5]/text()').get()
print(rank, name, revenue, profit, country)
Key XPath Syntax:
- Node selection:
//div
,x/div
,div/text()
- Predicates:
div[1]
,div[last()]
,div[@class="example"]
- Axes:
ancestor::
,following-sibling::
- Fuzzy matching:
contains(@href,"example.com")
2. Pandas read_html
For tabular data in HTML, Pandas offers a one-line solution:
import pandas as pd
from io import StringIO
df = pd.read_html(StringIO(response.text))[0]
print(df.head())
# Export options
df.to_excel('fortune500.xlsx')
df.to_sql('fortune500', create_engine('mysql+pymysql://user:pass@localhost/db'))
# Quick analysis
print(df['Country'].value_counts())
Pro Tip: When facing IP blocks, integrate proxy rotation:
proxy = "http://user:pass@proxy_ip:port"
response = requests.get(url, proxies={'http': proxy, 'https': proxy})
[Continued in next part…]
These methods form the core toolkit for efficient web data extraction and transformation. The complete code examples are available on [GitHub repository].